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ABSTRACT 

Analyses for results of a series of studies examining 
int ercor re lat ions among a set of as many as p+1 variables are 
presented. Several estimators of a pooled or average correlation 
vector and its variances are derived for cases in which some studies 
do not report complete correlation matrices. A test of the 
homogeneity (consistency) of the correlation matrices is also given. 
Data from a synthesis of relationships among mathematical, verbal, 
and spatial ability measures illustrate the procedures. These data 
are taken from 10 samples (sample sizes 74, 153, 48, 55, 51, 18, 27, 
43, 35, and 34, respectively) from 4 studies exploring the 
relationship of spatial ability to Scholastic Aptitude Test scores 
for high school or junior high school students. The empirical Bayes 
procedure (based on the EM algorithm) involves no data loss and is 
recommended if it is reasonable to assume that the unobserved 
correlations are missing at random. Three tables present illustrative 
data. (Author/SLD) 
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Missing Data in Correlation Matrices 2 
Abstract 

This paper outlines analyses for results of a series of studies examining 
intercorrelations among a set of as many as £-1-1 variables. Several estimators 
of a pooled or average correlation vector and its variance are derived for 
cases in which some studies do not report complete correlation matrices. A 
test of the homogeneity (consistency) of the correlation matrices is also 
given. Data from a synthesis of relationships among mathematical, verbal, and 
spatial ability measures illustrate the procedures. The empirical Bayes 
procedure (based on the EM algorithm) involves no data loss, and is 
recommended if it is reasonable to assume that the unobserved correlations are 
missing at random. 



Missing Data and the Synthesis of Correlation Matrices 
Many research syntheses which examine relationships in education and the 
social sciences examine one relationship or at most a few different bivariate 
relationships. In some research domains, however, series of studies may 
examine similar or identical collections or sets of variables. One example is 
the literature on the prediction of college grade-point average from entrance- 
examination scores and high-school records. In such cases it may be desirable 
to combine the correlation matrices among the variables common to a number of 
studies in order to draw general conclusions about the interrelationships 
among the variables. 

When all the studies under consideration share a common population matrix 
it is sensible to estimate a common (pooled) correlation matrix. In other 
situations it may be useful to estimate the average of a series of correlation 
matrices (and its variance). One problem which arises in attempting to pool 
or average correlation matrices from series of studies is that some studies 
may not have measured every variable of interest. Consequently some of the 
correlations of interest may not be observed in every study. 

The first section below presents notation and a model for the results of 
a series of studies examining intercorrelations among a set of £+1 variables. 
Several estimators of a pooled correlation matrix and its variance are derived 
for the case in which correlations may be unobserved in some studies. 
Estimators based on available -data and complete-case analyses, and on 
imputation of both unconditional and conditional means are described and 
critiqued. An empirical Bayes estimator is also provided for the case in which 
a random-effects model is ajjsumed to underly the series of studies. Data from 
a synthesis of relationships among mathematical, verbal, and spatial ability 
measures (Friedman, in press) are used to illustrate the procedures. 
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Notation and Model 
Let Xi, . . . , Yp be random variables with the multivariate normal 

distribution. For example, may be an outcome and Y2 Yp may be £-1 

predictors. Consider the situation in which each of a series of k studies has 
examined correlations among these same 2 variables or a subset of those 
variables. The number of measured variables in the ith study will be denoted 
2i and the number of nonredundant correlations reported by study i is 111^ - 
Ei(Ei-l)/2. 

Consider first a study i which has examined the intercorrelations among 
all 2 variables (i.e., in which -2)- Let r^^t and ^i^^ be the sample and 
population correlations between Y, and Yt in the i^^ study and let ri - (rii2, 

LiUf £up. II123 £icp-i)p )' and £i - (Pii2. P113. Piip, P123. ...» 

Pi(p-i)p) ' be the vectors of ra^ - e(u-1)/2 - nonredundant sample and 
population correlations, respectively. When it is convenient to refer to the 
elements of the vectors and by sequential position, a Greek subscript a 
or 7 will be used (e.g., 1^^^ and X17 are elements of - Thus - (r^^) where 
at runs over the range a - 1, 
Distribution of r 

Olkin and Siotani (1976) showed that if all correlations have been 
observed in study i, with a sample of size n^, the asymptotic distribution of 
Jn^ {^1 ■ £1) is normal with mean zero and variance -covariance matrix that 
depends on S^. This implies that in large samples, is approximately 
normally distributed with mean vector and variance-covariance matrix 2^, 
where the elements of are defined by a^^if and 

a,^ - Var(ii^) - (1 - p,J^)^/ni , (1) 
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and 

^ia7 - Cov(ri^. 1:17). 

A formula for a^^^ is given by Olkin and Siotani (1976, p. 238). The 

covariance can most easily be expressed by noting that if r^^ - r^^^ , the 
correlation between the s^^ and t^^ variables in study i; - , and p^^^ 

^iuv corresponding population values, then 

Gov (ri3t , r^^v ) " [0.5 p^^^ p^^^ (pf,^ + pf,^ + pf^^ + pf^^ ) + 

/'isu Pltv + Plsv Pltu - (Plst Plsu Plsv + Pits Pltu Pltv + 

Plus Plut Pluv + Plvs Plvt Plvu )]/ni. (2) 

Typically and a^^^^ are estimated by substituting corresponding sample 

estimates for the parameters in (1) and (2). These estimates are denoted 
below as a^^^ and a^^^. 

Missing Correlations 

When a study has measured fewer than the £ variables of interest in the 
series of studies, is less than £. The vector - (r^^) for i - 1 to nii 
would then have length < 2*. For convenience, however, we will use the 
subs-ript a to represent the particular relationship measured by rather 
than the position of x.^^ in the (shortened) vector r^. Thus every vector r^ 
will have length 2*. but for studies in which fewer than ji* correlations have 
been observed will contain observed correlations and 2* - Dli unobserved 
values. The unobserved values will be identified via an indicator vector - 
(mior) , a - 1 , where 
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Missing Data in Correlation Matrices 6 
1, if r^^ is observed, 
0, otherwise. 



If mi - mj for studies i and j., we say that studies i and j. have the same 
missing data patterns. Also note that 2« m^^ - mi and denote Si nii - m. 

Missing covariances. When a study i observes nii < pi* correlations, the 
covariance matrix Si defined by (1) and (2) will contain even fewer than 
ffli(mi-l)/2 observed covariance values. A hidden consequence of missing 
correlations is that covariances between other reported correlations become 
impossible to compute. This results from the form of the covariance in (2). 
Thus, for instance, in study i the correlation ri^v is needed to compute the 
covariance between r^^^ and r^^^. 

The values of covariances between observed correlations are indicated by 
the Hadamard product (★) of the matrix mi mi' with the full 2* x matrix S^, 
specifically mi mi' ★ Si. The matrix mi mi' contains zeros in the positions of 
covariances between unobserved correlations, and ones for covariances between 
observed correlations. The Hadamard product thus shows covariance values 
where they are observed (or imputed), and full rows and columns of zero values 
elsewhere. To use the matrix mi mi' * Si in computations involving matrix 
inversion, its dimension must be reduced from x to nji x mi by removing 
all columns and rows which are identically zero. This corresponds to ignoring 
the unobserved elements in the vector ri and their associated variances and 
covariances . 

The literature on missing data from experiments and sample surveys offers 
little specific assistance in how to deal with missing covariances, One 
approach is to simply ignore potential dependencies between correlations for 
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which covariances cannot be computed. That is, the covariances could be 
estimated as zero. However, since results are typically intercorrelated^ 
this may lead to overweighting the results of studies which have not reported 
full correlation matrices. 

An ad hoc adjustment that might be made is to impute values for the 
missing correlations into the covariance formula (2) using one of the methods 
discussed below. Becker (1992) discussed two other ad hoc approaches to 
computing missing covariances. In one approach pooled correlations were 
substituted for subsample values in a study which had not reported complete 
correlation matrices for the two subsamples of interest. Another approach 
used patterns of correlations between tests at two times (before and after an 
intervention) to estimate the between-test correlations across time (e.g., the 
correlation of pretest A with posttest B) . 
Results of Series of Studies 

The results of k independent studies, each examining as many as 
correlations, can be expressed as the concatenation of the vectors r^, .... r^. 
containing the nonredundant elements of the matrices of results of the k 
studies. Let the k^* x 1 vectors of (observed and unobserved) sample and 
population correlations be denoted as 



r - 



and 
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respectively. If the sample sizes of the k independent studies tend to 
infinity at the same rate (formally if N - S U and if the - n,/N for i 
- 1 to k remain fixed as N «) then (r - has a nondegenerate asymptotic 
distribution as N ^ CO. This leads to the large sample approximation that r is 
normally distributed about i. The large sample variance -covariance matrix of 
r is Chen S, where E is a blockwise diagonal matrix with submatrices Si 
through Sjj, and is defined above. Specifically, 



2k 



(3) 



When some studies have not observed all correlations, we also require the 
concatenated vector of zeros and ones 



m - 



mi 



nib 



7."he total number of observed correlations is S 2 nii^ - 2^ - m. Also note 
that if 2i ffii^ - ka - k, then all studies have observed correlations for the 
ath relationship. As above, the covariances among observed £s are indicated 
by the matrix mm' * E. 

Estimating the Pooled Correlation Matrix 
When all of the studies share a common population correlation matrix, 
that is when - . . . - i^., it makes sense to pool estimates from the studies 
to estimate the common correlation matrix. In practice one would first test 
the hypothesis that all studies arise from a single population, then estimate 
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Missing Data in Correlation Matrices 9 
either a pooled (common) or average correlation matrix. Procedures for 
estimation and testing of the pooled matrix for the case in which all 
correlations are observed (from Becker, in press) are repeated here for 
con\ nience of notation. Results for the incomplete -data case are given in 
the next section. 
Notation and Mode ; 

To estimate a common correlation vector of length the generalized 
least squares (GLS) model is 

r - X £. + e, 

where r is the vector of k£* correlation coefficients, £. - (p.^ p. p.) is 

the set of common correlations to be estimated, ana X is a k£* x 2* matrix 
created by "stacking" Jc identity matrices, each of dimension x 2*. If k - 
10 and E* - 3 (as in the examples which follow), X would be the 30 x 3 matrix 

X - r 1 0 0 

0 10 
0 0 1 



10 0 
0 10 
0 0 1 



10 0 
0 10 
0 0 1 



Under the assumptions of the GLS model, the error vector e - r - X f. has mean 
zero and approximate covariance matrix 2. The estimate of the pooled 
correlation matrix is then simply the usual GLS estimate of the regression 
coefficients, here, 
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10 



r. - (X' X)-^ X' S'^ r 



(^) 



with apprcxlmate variance-covariance matrix given by 



V - (X' S"^ X)"^ 



(5) 



Typically both r. and V are computed using an estimated variance matrix in 
place of 2. Wlien the large-sample normality of the vector r is justified, r. 
can also be assumed normal, and standard inferential procedures (e.g., 
confidence intervals, test of significance about the elements of S. ) are 
possible . 

Test of Homogeneity 

Becker (in press) also presents a test for homogeneity of correlation 
matrices, similar to that derived by Hedges and Olkin (1985). The test of the 
hypothesis of homogeneity of correlation matrices, that is to test 



significance is given by rejecting Hq if 2 exceeds the lOO(l-a) percentile 
point of the chi-square distribution with (k-l)ii* degrees of freedom. 




uses the stctistic 



Q . r' [Z-^ - X(X' Z"^ X)-^ X' ] r. 



When Ho is true Q has approximately a chi-square distribution with 

k^* - 2* degi^ees of freedom. Thus a test of Hq at the 100a percent level of 



Estimation when some Correlations are Unobserved 



Complete-case Analysis 



The complete -case analysis approach to missing data suggests that 
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parameters be estimated using those cases (here, studies) which report 
complete data. The drawback of this approach is that the data loss can be 
great if more than a few studies fail to report all correlations, Because 
exact replications are discouraged by journal editors and avoided by 
researchers, this problem is likely to be encountered in syntheses of most 
research domains in the social sciences, The complete-case analysis is not 
the analysis of choice in most situations. However, it may be useful in 
providing estimates to use, for example, in computing missing covariances. 

The complete-case analysis would involve the application of GLS 
estimation methods to the set of results of studies reporting all 
correlations. Denote the number of studies which report complete correlation 
matrices as k^ . To estimate a common correlation vector of length the 
model is 

r^ - Xe £. + e^., 

where r^ is a concatenation of the vectors (like r above) but includes only 
the results of the Jc^ samples with complete matrices, £. - (p,i, ,,,, p,p*) is 
the set of common correlations to be estimated, and is a k^^^' x ^''^ matrix 
created by "stacking'' ^ identity matrices, each of dimension x 2^, The 
variance matrix for the vector is denoted 2^, and contains the matrices 
for the samples with complete results. 

The estimates of the pooled correlation matrix and its variance are then 
simply the usual GLS estimates, computed using r^, X^, and in place of r, 
X, and 2, respectively, in (4) and (5), 
Available-cases Analysis 

One relatively simple approach to handling missing data in multivariate 
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analysis is to use "available-cases" analysis (Little & Rubin, 1987, p, 41), a 
number of different estimators are possible within this framework The 
estimate given here is essentially based on pairwise available cases. 

The available-cases estimate presented here is a generalization of the 
pooled correlation matrix estimated via generalized least squares (GLS) shown 
above in (4) and (5). However, since - m of the possible correlations in r 
are unobserved, we omit all rows of r and X that represent unobserved 
correlations. We denote the reduced vector and matrix as and Xq, Both ro 
and Xq then contain m rows. 

Also we reduce the dimension of 2 (or 2) as described above (omitting 
rows and columns of all zeros), and denote the new m x m covariance matrix as 
Eq. As noted above, it is also necessary either to assume that the missing 
covariances between reported r values (which require unobserved r values to be 
computed) equal zero, or to estimate those covariances using other values for 
the missing is. 

We rewrite the GLS model as 

ro - Xo f. + eo. 



In the new model the error vector eo - - Xq £. has approximate 
covariance matrix Sq, so the GLS estimate of f, is 

r. - (Xo' Sb"^ Xo)-^ Xo' V' ro (6) 

with approximate variance covariance matrix given by 

V - (Xo' So"' Xo)-i. (7) 
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For the available case analysis the test of the hypothesis of homogeneity of 
correlation matrices uses the statistic 

Q « ro' [V' - V Xo(Xo' So"' Xo' 1 Tq. 

When Ho (given above) is true Q has approximately a chi-square distribution 
with m - 2* degrees of freedom. Thus a test of Hq at the 100a percent level 
of significance is given by rejecting Hq if Q exceeds the lOO(l-a) percentile 
point of the chi-square distribution with ra - degrees of freedom. 
Imputing Unconditional Means 

Using this approach we would substitute for each unreported r value the 
appropriate mean estimated by the available-case analysis, then proceed (e.g., 
with GLS estimation) as if the data were complete. Specifically, if the value 
ic^^ is not reported in study i, one would substitute the mean r.^ computed 
using (6) above. Although one typically obtains reasonable average values 
using this approach, variances and covariances are systematically 
underestimated because the imputed values by definition lie near the center of 
the distribution of the observed correlations (Little & Rubin, 1987). In 
meta-analysis this implies that tests of homogeneity can be reduced when mean 
values are substituted. 

Two complications which arise with this approach involve questions of 
homogeneity and the precision of the predicted correlations (the unconditional 
means). Unlike the two approaches described above, this approach involves 
substituting particular values for the missing correlations, and using them as 
if they had been reported by the studies as actual data. Thus it is important 
to ask whether the means that are imputed are "reasonable" values. 

Because the unobserved correlations are unavailable for comparison it is 

ERLC 
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impossible to really gauge whether the substituted mean values are 
appropriate. However, one indication of the representativeness of the means 
for the studies from which they are obtained is the test of homogeneity. Thus 
failure to reject the hypothesis of homogeneity for the complete cases 
suggests that the mean values are good measures of the relationships of 
interest in all of the studies which reported them . 

The second question which arises when imputed values are used in the GLS 
estimation framework is how their sampling variances and covariances should be 
computed. The formulas (1) and (2) assume that zcie correlation values are all 
computed for the same sample. Substitution of other values into these 
formulas can lead to correlations between rs that are out of range and to 
within- study covariance matrices that are not positive definite. Standard 
imputation procedures for missing data in experiments suggest a number of 
adjustments for the general underestimation of sample variances, however, in 
those cases all missing values on a particular variable are homoscedastic , 
which is not generally the case in meta-analysis. 

Ad hoc estimates of sampling variability were used in the present 
analyses. The variance of the imputed correlation r.^ in study i was computed 
as 

^ifltft + (8) 
where is the estimated sampling variance in (1), computed using the mean 

and n^, and is the variance of the mean r.^ from the available -case 
analysis (i.e., the a^^ diagonal element in (7) above). 

The rationale for (8) is based on the theory for the estimation of a 
value of a response variable Y* from a predicted value Y* in linear regression 
(see Seber, 1977, Sec. 5.3). In that simpler case, Seber noted that var[Y* - 
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Y*] - a^(v:* + 1)^ where is the variance of Y* and v*a^ is the variance of 
the predicted score based on a particular set of predictor values (x*) . The 
confidence interval for Y* is computed using an estimate of the standard error 
a(v* + 1)^/2. The analogue in the present context is to use the estimated 
variance of the predicted correlation (i.e., the mean) in place of var(Y*) and 
the sampling variance computed from (1) in place of var(Y*) . 

Covariances involving r.^^ were computed as CovCr.^, ri^) , for a 7, using 
formula (2). However, the question of how best to estimate both variances and 
covariances involving imputed values requires further investigation. 
Imputing Conditional Means 

This method, proposed by Buck (1960), involves estimating the unreported 
correlations from a prediccion model based on the complete cases. Typically 
(in practice) several regression models would be estimated, one for each 
variable (i.e., correlation) with unreported values. The unreported values 
are estimated case by case for each variable. 

In the multivariate meta-analysis context the predictors of a correlation 
with unreported values would be the other observed correlations. For 
instance, consider a case in which some values of are unreported but all 
other correlations are completely reported. To predict missing values of the 
a^^ correlation in study j. one would regress the reported values of v^^ on 
values of j^^i, Xk^-d, Xko+d*'.. £ip*» then use the reported correlations 

in study j. to predict ij^r- Weighted least squares regression should be used 
in the meta-analysis context (see, e.g.. Hedges, 1983), weighting each value 
of r^cj by the inverse of its variance. The drawback of this approach is, 
again, that estimation may be difficult if few studies provide complete 
correlation matrices. 
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Missing Data in Correlation Matrices 16 
The analysis then proceeds as if the data were complete, with the 
predicted values in place of the unobserved correlations. Also, because the 
unobserved values are predicted from the other data, their variances and 
covariances are again not given by cr^^^. Below another ad hoc variance 
estimate is used (following the same rationale as above). Specifically, the 
value 

^Lacc + 2r^iaa. (9) 

is used, where a^^^ is the sampling variance (1) computed using the imputed 
value of r^^, v*i^^ « x*i' x^^, and x^i - (1, 

the vector of "predictor values" for the ith study. The value v^^^^t Is an 
approximate variance of the predicted r value. Note, however, that this 
variance does not account for the fact that the predictors are themselves 
random variables, measured with uncertainty. Similarly, the regression slope 
estimates treat the predictors as though they are known. A more appropriate 
but more complex analysis could treat the observed correlations as regrassors 
measured with error (e.g., Seber, 1977, sec. 6.4). Covariances are computed 
as Cov(£ijj, Xi^) using formula (2). 

Empirical Bayes Estimation 

In some situations it may be more reasonable to expect the patterns of 
intercorrelations among a set of variables to differ between studies. 
Population correlations might be expected to vary if a variety of subject 
groups had been studied. Even if variation in the patterr of correlations is 
not expected, the test of homogeneity may suggest that the population 
correlation matrices differ. 

When the population correlation matrices vary a random-effects model may 
be appropriate for the data. If we are willing to treat the distribution of 
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Missing Data in Correlation Matrices 17 
the population correlations as a prior distribution for the data, we can use 
empirical Bayes estimation techniques to estimate the mean correlation vector 
and its variance. 
Random- effects Model 

Consider again the large sample distribution of the correlation vector 
r^. The result that r^^ is approximately normal with a mean 2^ implies that we 
can write the vector r^^ in terms of a parameter 2^ plus a vector of errors, 
say e^. That is, 

^^1 - 5i + e^, (10) 
and is then distributed approximately normally with a mean of 0 and 
variance S^, for i - 1. . . . , k» where the elements of 2^ are given by (1) and 
( 2 ) above . 

In the random- effects case we assume further that each vector of 
parameters is composed of a common component £. - (P-a) for a - 1, .... 2* 
plus a residual vector, say, u^. Specifically, 

2^ - + u,, (11) 

for i - 1, . . . t Th© variation represented by is parameter variation, 

rather than sampling variation, which is represented by the error term 
above. That is, we assume that the vectors of population correlations vary 
randomly about a common mean (which we wish to estimate). We denote the 
matrix of parameter variances as T - (^ni) a, 7 - 1 to 2*- 
Estimation 

The estimation of £. and T can be accomplished via the EM algorithm 
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Missing Data in Correlation Matrices 18 
(Dempster, Laird, & Rubin, 1977; Dempster, Rubin, 6c Tsutakawa, 1981), an 
iterative procedure. Implementation of the algorithm in the present context 
involves, first, imputation of the conditional means of the missing data 
values (given the observed correlations) as described above. The observed and 
imputed values are then treated as complete data, and initial estimates of the 
mean and variance component (r.^o) ^^d T^q)) are obtained. 

The estimated mean and variance are next treated as a Bayesian prior for 
the observed (and imputed) correlations. Weighted estimates of the vectors 
are then computed, as are their standard errors. The cycle begins again as 
these "study-parameter" estimates are used to re-estimate the means and their 
standard errors. The iteration between these two procedures continues until 
the estimates of the mean vector and the parameter variances do not change 
materially with added iterations (i.e., until the maximum of the likelihood 
function is attained) . 

In some situations implementation of the EM algorithm can be 
computationally intensive. However, the computations in the present case are 
relatively straightforward. The appendix gives a program written using SAS 
PROC MATRIX which accomplishes the computations outlined below. 

Posterior distribution of The estimation of the mean vector £. and 
the variance-covariance matrix T requires the posterior distribution of the 
vector of study parameters pn through pj^p* (i.e., g) . From model (10) above 
and the distribution of the sample correlation in (1) and (2) we know that for 
large samples the within-study sampling error (ei^) is normally distributed. 
Since 

Lia Pia + ^oT a - 1 to and i - 1 to k, 
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then if N - and - ni/N remain fixed as N approaches a,, we can write 

^1 (r . A N(0, 2*), 

where S* is defined via E* - v/N 2, and the elements of S are given (1) and 
(2). ilius the approximace density of the vector r. conditional on the vector 
of study-parameters £ ".s given by 

f(r I £) . |S|i/2 (2;r)i^P^/2 exp{-4 (r - a)^^Hr - £)'). 

The second-stage model (11) shows the population correlation vector for 
each study varying around the mean population correlation for the a^^ 
relationship, across studies. In terms of individual correlations, we write 

Pia ^ P'Qc Mia. for a - 1 to and i « 1 to k. 

We define r^^ - CovCp^^, p^^) for i - 1 to k and a, 7 - 1 to If we are 

willing to assume that the study-parameters p^^ are normally distributed about 
the means p,^, then we can write the density of the vector of study-parameters 
a as 

where T is a kfi* x k^* blockwise diagonal matrix containing k (£* x £*) blocks 
of rQ,7 values and £. is a kp* x 1 vector defined as 
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a- - (pi, P2» pp*. Pi. P2» Pp*). 



That is, is a concatenation o" k sets of the average population 
correlations through Pp.. This slight variation on the notation used above 
gives £. the same dimension as £ , the vector of study parameters. 
The posterior distribution of £ given r is then 

f(£ I r) a f(r \ £) = |2 T|^^2 (2^)kp- exp{-4 (r - £)2-^(r - ' } 

X expl-4 (£ - £.)T-^(£ - 
oc exp{-h (r - £)2-kr - £) ' + -4 - ^.)T\^ - £.)'). (12) 

By expanding the quadratic forms in (12) and eliminating terms that do not 
depend on £ we obtain 

f(£ I r) a exp{-4 S'^ - 2£ S"^ r' + £ T"^ £' - 2^ T"^ £, ' ] } 

" exp{-h [£ (2-^ + T-^) £' - 2£ (S'l r' + T^^ £.')]). (13) 

We next define the matrices 

and 

£i' - (S-^ + T-^)-^(2-^ r' + £.') - >lr (S-^ r' + 

Note that although is a one -dimensional vector, we will denote its elements 
as piij in order to identify the study and relationship associated with each 
element. The elaments of are thus arrayed as (pm, Pii2» ---f Piip* 
Plkl» Plk2» • • • » Plkp*) • 
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Next multiply (13) by the term exp{-4 a,'), which is independent of 

£ . Substituting for (S'l + T"!) and g,' for (S'l r' + T'^ g.') produces 

- exp{-4 r^g' - 2g + £i g^']) 

- exp{-H - gi)rHfi - fil)' ), ' (14) 

which is the kernel of the multivariate normal distribution. Thus the 
posterior distribution of g (given r) is normal with mean and variance *. 

EM al g orithm . The EM algorithm makes use of the distribution defined by 
(14) in the E or expectation step of the process. The EM approach requires 
initial estimates of T and through (the average correlations). Because 
thc^e are starting values, simple estimators are typically all that is needed. 
The starting value? T'^' and - r. are used to compute the posterior mean 

of £ and its variance, that is, fi/i' and ¥^K New estimates of T and £. 
(i.e., T"' and g.'^') are then computed based on the sufficient statistics 
from the l^<-^'> values. (The specific forms of the estimates are given below.) 
The cycle continues until the likelihood in (14) is maximized, or practically 
speaking, until the differences between parameter estimates from one iteration 
to the next are small. 

Staytjp g values . For starting values we use weighted method- of -moments 
estimators for a - 1 to e* and f<°' - (^^/°') for a, 7 - 1 to 2*, 

specifically, 



- - Si Wi„ ri„ , 



and 
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for 

where w^^, - (w^^ Wi^)^/^ ^ weight associated with correlations r. and r 
Hia * [lAia<i]/2s[lAsfta] (that is, w^^ is the usual inverse- variance weight used 
in univariate meta-analyses), and where the values of a^^^ are given by (1) 
and (2). These estimators are superscripted with the index zero to indicate 
that they are starting values. 

When the amount of variation in the sample correlations for the a''^ 
relationship is quite small the variance estimate f^/o^ can frequently be 
negative. By convention, negative values are set to zero, as would be any 
other covariance estimates involving the relationship (i.e., values of 
fa^^°^ for that value of a) . 

Expectatj.on step. The posterior distribution of the values p^^, 
Pip*> — Pki» Pkp* (given the data) is then used to obtain estimates of 

the study parameters. These are essentially weighted combinations of the 
original data (the rs) and the starting values of the mean correlations 
through pp*. We compute 



and 



$(0) . (2"^ + [T^0J]-i)-i 



Maxtqitgatton steg. In this step new estimates of T (i.e., T^^^) and the 
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mean correlation vector are obtained from the sufficient statistics for the 
study-parameter estimates. The estimates of the elements of T and g . on 
iteration (t + 1) are given by 

• [^i Pua^^M/k, for a, 7 =- 1 to 

where 01^-,^^^ is an element of the matrix PiiJ"-^ is an element of and 

^^(t> . ^(t) (2-1 r' + [T(^>]-i ^.^^^'). 

Iteration, The process of estimation and maximization is repeated until 
the likelihood function is maximized, that is, until the parameter estimates 
(e.g., the estimates of T and through do not change much from one 

iteration to the next. Note, however, that the program given in the appendix 
stops after iterating for a fixed number of cycles rather than stopping after 
a convergence criterion has been met. 

ilL^§jLn&. Aata . The EM algorithm can be applied when all correlations have 
been observed or when some correlations are missing. When data are mijsing 
random (that is, when the reason that correlations are unobserved is unrelated 
to the actual values of the unobserved correlations) then it is possible to 
get maximum likelihood estimates by ignoring the missing-data mechanism. 
Little and Rubin (1987, Chapter 8) discuss this problem in detail for 
multivariate normal examples with unknown covariance matrices. The present 
case is similar, but involves normal data with a known covariance matrix. 

Application of Little and Rubin's methodology for handling missing data 



and 



(t+i) 
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adds only one step to the estimation process described above. Before 
obtaining starting values T^°^ and q,^^^ we must impute values of the 
unobserved correlations. 

The value imputed is the expected value of the missing correlation, 
conditional on the observed data. In practice this means substituting the 
best estimate of r^^ available, based on the observed data. If correlation r,^^ 
is missing, one would use the regression method for imputing conditional means 
to predict a value r^^ for study i, as described above. When some studies are 
missing more than one correlation, missing values would be estimated for each 
pattern of missing data, using an approach similar to that described In Little 
and Rubin's (1987) Chapter 6. The imputed values are then substituted into 
the data set and analysis via the EM algorithm proceeds as if the data were 
complete . 



Example 

Data 

Data for the example are from ten samples in four studies which explored 
the relationship of spatial ability to SAT scores for high-school or junior- 
high students. The ten samples from these studies are drawn from a more 
extensive synthesis of sex differences in the relations among math, spatial, 
and verbal ability measures by Friedman (in press). This example considers 
correlations among measures of at most three variables from each sample (i.e., 
- 3), as shown in Table 1. We have omitted the correlations between SAT-M 
and spatial ability reported for the two samples from Rosenberg (1981) to 
create an example with less than complete data. Correlations and sample sizes 
are shown in Table 2. 
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Insert Tables 1 and 2 about here 

In the ith study, the correlations among the three variables are 
represented in our notation as: 





Math 


Spatial 


Verbal 


Math 


1.0 


£ii 


Ei2 


Spatial 


111 


1.0 




Verbal 




Ei3 


1.0 



Writing these correlations as a vector r^, the relationships represented are 



Math-Spatial 

Math-Verbal 

Spatial-Verbal 



Ell 
£i3 



The ri vectors for four of the ten samples in the example are 



ri - 



.47 1 
-.21 
- .15 



.28 
.19 
.18 



£51 

.48 
.23 



re 



£61 

.74 
.44 
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Vectors and from Becker (1978) represent complete correlation matrices. 
However, in and rg from Rosenberg (i981), the elements r^^ and rgi are not 
observed in our example data. 

Each vector of correlations has an associated limiting variance- 
covariance matrix, computed using (1) and (2) above. The limiting variance- 
covariance matrices for the Becker (1978) samples are 



2i « 



.0082 - .0010 - .0018 
.0010 .0123 .0058 
.0018 .0058 .0129 



and E2 



.0056 .0009 .0010 
.0009 .0061 .0016 
.0010 .0016 .0061 



•The two matrices for the samples from Rosenberg (1981) ar^^ , respectively 





^511 


^512 


^513 




^611 


^612 


^613 




^521 


.0116 


A 

^523 


and 25 - 


^621 


.0114 


A 

^623 




. ^531 


A 

^532 


.0176 




_ ^631 


A 

£767- 


.0361 



The covariances of the reported correlations from Rosenberg (the off-diagonal 
elements) must also be imputed because values of rj^ and £51, respectively, are 
needed in their computation. 

The vector of all correlations to be sjmthesized then is 



r - (.47 -.21 -.15 .28 .19 .18 .48 .41 .26 .37 .40 .27 £51 .48 .23 

rgi .74 .44 .26 .72 .36 .32 .52 .10 .58 .64 .40 .34 .28 -.03) 



and its variance t is the 30 x 30 blockwise diagonal matrix comprised of 
through SiQ. 
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Complete-case Analysis 

Complete data from eight of the ten samples (i.e.. from all studies 
except Rosenberg (1981)) is used to estimate i. and V. The GLS estimate of 
the mean correlation vector is 

(.367, .202, .421)'. 

with variance -covariance matrix 



0014 


.0002 


.0005 


0002 


.0014 


.0005 


0005 


.0005 


.0017 



The test of homogeneity for the complete-case analysis is Q * 62.09, 
which under the null hjqpothesis of homogeneity is a chi-square with (8-1)3 or 
21 degrees of freedom. The value of Q is larger than the upper- tail a - .05 
critical value for 21 degrees of freedom, suggesting that the eight samples do 
not share a single population matrix. Thus although we have used the data 
above to estimate a pooled correlation matrix, the interpretation of that 
matrix as a shared or common population matrix seems unwarranted. 
Available -case Analysis 

In Friedman's data ra - 28, so two elements of r and two rows of X are 
eliminated. Thus the reduced matrices are 

I 
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.47 

- .21 

- .15 
.28 
.19 
.18 
.48 
.41 
.26 
.37 
.40 
.27 
.48 
.23 
.74 
.44 
.26 
.72 
.38 
.32 
.52 
.10 
.58 
.64 
.40 
.34 
.28 

- .03 



and Xq 



1 

0 1 
0 0 



0 0 
0 



1 

0 1 
0 0 



1 

0 0 



1 

0 1 
0 0 



0 

1 

0 0 



1 

0 
0 
0 
0 
0 
0 

1 

0 
0 

1 

0 1 
0 0 



0 

1 

0 0 



1 

0 1 
0 0 



1 

0 1 
0 0 



0 

1 

0 

1 

0 

1 



0 0 



0 

1 



0 0 
0 



1 

0 0 



0 

1 

0 0 



0 

1 



The rows shown in bold represent the Rosenberg results. Note that every row 
of X for Rosenberg shows a zero in column one (i.e., the column for the firs: 
element of the pooled correlation matrix) . In order to estimate the 
covariances between the two reported values for the Rosenberg samples, r.^ - 
.367 from the complete-case analysis was used as the value of r^. 

The estimate of the pooled correlation matrix from the available-case 
analysis, using (6) above, is 

r. - (.374, .437, .227)' , 
with variance- covariance matrix computed from formula (7) as 
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0014 


.0001 


.0004 


0001 


.0011 


.0004 


0004 


.0004 


.0015 



The overall test for homogeneity is Q - 82.12 (df - 25, £ < .0005). The 
six samples do not seem to share a common correlation matrix. Thus again the 
interpretation of r. as a shared or common population matrix seems 
unwarranted. 

Imputing Unconditional Means 

In this analysis mean values were substituted for the two missing 
correlations. For our example, the value r.^ = .373 was substituted for r^i 
and Tg^. The variance of the mean (V^^^^ 0.0014) was added to the computed 
values of cr^^^ and cr^^^. The GLS estimate based on all 10 samples, including 
the imputed data points, was the mean vector 

(.363, .463, .228)' 
with variance -covariance matrix 



0012 


.0001 


.0005 


0001 


.0011 


.0005 


0005 


.0005 


.0015 



The estimated variance of the first element of the correlation vector has only 
decreased from .0014 (in the available-cases analysis) to .0012, which 
corresponds to a standard error which is roughly six percent smaller (i.e., 
.035 versus .037), which would have only a small impact on inferential 
procedures. The homogeneity test value of 73-17 is significant when compared 
to the a - .05 upper- tail critical value of the chi- square distribution with 
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- 1)2* 27 degrees of freedom. 
Imputing Conditional Means 

In the imputation of conditional means we again estimate the missing 
values using data from the eight cases which report all three correlations. 
The weighted regression of r^^ on and r^j (weighting by the inverse of the 
variance of each r^^ value) for the eight samples with complete data gives the 
weighted regression model 

r^i - 0.488 - 0.024 r^z + 0.062 r^j . 

This model predicts values of r^i - .391 and rg^ - .398. Our example data do 
not illustrate the potential advantages of this procedure well because r^i is 
essentially unrelated to ri2 and r^j . 

Because the two unobserved values have been predicted from the other 
data, their covariances are computed as 

^511 + Y*5n - .0141 + .0025 - .0166, 

and 

^sii + ir*6ii - -0393 + .0058 - .0451. 
Covariances involving ^.^ were computed using r.^ in place of ^.n in formula 
(2). 

For our data, the estimated mean correlation vector (using GLS 
estimation) is (.367, .461, .224)', with variance- covariance matrix 



0012 


.0001 


.0005 


0001 


.0011 


.0004 


0005 


.0004 


.0015 



The test of homogeneity for this analysis is Q - 71.99, which is approximately 
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distributed as a chi -square variable with 27 degrees of freedom. As above, 
the hypothesis of homogeneity is rejected. 
Empirical Bayes Estimates 

Next the mean vector and its variance-covariance matrix were estimated 
via the EM algorithm. We first imputed the values r^i - ,391 and rg^^ = ,398 
(with estimated variances ,0166 and ,0451, respectively), using the method of 
imputing conditional means described above. 

The starting values for the correlation vector and its variance 
covariance matrix were 



and 



p(0) - 


(. 


39, .42. 


.21) ' 








.0006 


- .0005 


- .0054 






- .0005 


.0723 


- .0396 






- .0054 


- .0396 


.0146 



After 600 iterations the values of the mean correlations and their variance 
estimates were changing by less than 10"^, The estimated mean vector was 

(.393, ,424, ,226)' 

with variance-covariance matrix 





.0004 


.0006 


.0001 


ij>(600) _ 


.0006 


.0619 


.0323 




.0001 


.0323 


.0170 
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The parameters representing the relationship of SAT-M with SAT-V (the 
values) showed the most variability, with a standard error of nearly 0.25 
i.e., the square root of the diagonal element .0619). The correlations 
between SAT-V and spatial ability also showed considerable parameter 
variation, with a standard error of 0.13. 

The empirical Bayes estimates of the individual study parameters after 
600 iterations are shown in Table 3. These values can be compared to the 
original sample correlations. The minimal amount of variation in the SAT-iM 
spatial ability correlations has led to very similar estimates of p^^ for the 
ten samples. Values of r^^.^ which showed considerable variability, produced 
more dispersed values of p^z- 



Insert Table 3 about here 



Conclusions 

Missing or unreported study results are an impediment to thorough reviews 
of any research literature. The problem of unreported correlation values is 
pervasive in research reviews which attempt to synthesize results of complete 
correlation matrices, especially matrices which involve more than a few 
variables. The methods reported here, particularly the empirical Bayes 
estimation procedures, should enable researchers to accomplish reasonable 
initial analyses in situations wherein the unreported values appear to be 
missing at random or simply not studied. Further work is needed to explore 
cases in which the missing-data mechanism is more complicated (e.g., involving 
truncation) in which the data are unlikely to be missing at random. 
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Table 1 

Variables Measured in Example Studies 



Measures 



Study Math Verbal Spatial ability 



Becker 


SAT 


-M 


SAT. 


•V 


Differential Aptitude 


Tests: Space Relations 


Berry 


SAT 


-M 


SAT- 


•V 


Thurstone and Jeffrey 


Concealed Figures Test 


Rosenberg 


SAT 


-M 


SAT- 


•V 


Differential Aptitude 


Tests: Space Relations 


Weiner 


SAT 


■M 


SAT- 


V 


Group Embedded Figures 


Test 
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Table 2 

Sample Sizes and Correlations for Example Data 

Correlations 

SAT-M SAT-M SAT-V 



Sample id 


Samole 




Sample size 


Spatial 


SAT-V 


SpaCial 


1 


Becker 1 


(1978) 


Hi » 


74 


.47 


- . 21 


- . 15 


2 


Becker 2 


(1978) 


nz - 


153 


.28 


.19 


.18 


3 


Berry 1 (1957) 


n3 - 


48 


.48 


.41 


.26 


4 


Berry 2 (1957) 


D« =- 


55 


.37 


.40 


.27 


5 


Rosenberg 


1 (1980) 


25 - 


51 




.48 


.23 


6 


Rosenberg 


2 (1980) 


Eg - 


18 


£61 


.74 


.44 


7 


Weiner 1 


(1984) 


n? - 


27 


.26 


.72 


.36 


8 


Weiner 2 


(1984) 


S8 - 


43 


.32 


.52 


.10 


9 


Weiner 3 


(1984) 


D9 - 


35 


.58 


.64 


.40 


10 


Weiner 4 


(1984) 


Sio - 


34 


.34 


.28 


- .03 
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Table 3 

Population Correlations for Example T)^t- a Estimatf^d uslnpr RM Algorithm 

. Correlations 









Pi?. 








SAT-M 


"5 AT . M 


OAT \1 

oAl - V 


Sample Id 


Samr'' e 


S n A f 1* 1 

'-f w ai \^ 1, a 1. 


»Jf\ J. " V 


apauiai 


1 


Becker 1 (1978) 






- . (jM-y 




Becker z (1978) 


.381 


. 244 


.136 


3 


Berry 1 (1957) 


.396 


.402 


.214 


4 


Berry 2 (1957) 


.391 


.418 


.224 


5 


Rosenberg 1 (1980) 


.394 


.469 


.249 


6 


Rosenberg 2 (1980) 


.394 


.706 


.374 


7 


Weiner 1 (1984) 


.392 


.722 


. 383 


8 


Weiner 2 (1984) 


.395 


.493 


.261 


9 


Weiner 3 (1984) 


.402 


.583 


.306 


10 


Weiner 4 (1984) 


.393 


.303 


.162 
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Note 



1. For instance, Becker (1992) found correlations among rs ranging from small 
negative to large positive values in a synthesis of predictors of science 
achievement. Correlations were as large .82 between rs which represented 
similar relationships. ^ 
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